Collision handling apparatus and method

ABSTRACT

The present invention relates to mechanisms for handling and detecting collisions between threads ( 5, 6, 7 ) that execute computer program instructions out of program order. According to an embodiment of the present invention each of a plurality of threads ( 5, 6, 7 ) are associated with a respective data structure ( 9, 10, 11 ) comprising a number of bits ( 12 ) that correspond to memory elements (m 0 , m 1 , m 2 , m n ) of a shared memory ( 4 ). When a thread accesses a memory element in the shared memory, it sets a bit in its associated data structure, which bit corresponds to the accessed memory element. This indicates that the memory element has been accessed by the thread. Collision detection may be carried out after the thread has finished executing by means of comparing the data structure of the thread with the data structures of other threads on which the thread may depend.

FIELD OF THE INVENTION

The present invention relates in general to execution of computerprogram instructions, and more specifically to thread-based speculativeexecution of computer program instructions out of program order.

BACKGROUND OF THE INVENTION

The performance of computer processors has been tremendously enhancedover the years. This has been achieved both by means of makingoperations faster and by means of increasing the parallelism of theprocessors, i.e. the ability to execute several operations in parallel.Operations can for instance be made faster by means improvingtransistors to make them switch faster or optimizing the design tominimize the level of logic needed to implement a given function.Techniques for parallelism include processing computer programinstructions concurrently in multiple threads. There are programs thatare designed to execute in several concurrent threads, but a programthat is designed to execute in a single thread can also be executed inseveral concurrent threads. If the execution of a program in severalconcurrent threads causes program instructions to be executed in anorder that differs from the program order in which the program wasdesigned to execute the thread execution is speculative. The discussionhereinafter focuses on such speculative thread execution.

A computer program that has been designed to be executed in a singlethread can be parallelised by dividing the program flow into multiplethreads and speculatively executing these threads concurrently usuallyon multiple processing units. The international patent applicationWO00/29939 describes techniques that may be used to divide a programinto multiple threads.

However, if the threads access a shared memory, collisions between theconcurrently executed threads may occur. A collision is a situation inwhich the threads access the shared memory in such a way that there isno guarantee that the semantics of the original single-threaded programis preserved.

A collision may occur when two concurrent threads access the same memoryelement in the shared memory. An example of a collision is when a firstthread writes to a memory element and the same memory element hasalready been read by a second thread which follows the first thread inthe program flow of the single-threaded program. If the write operationperformed by the first thread changes the data in the memory element,the second thread will read the wrong data, which may give a result ofprogram execution that differs from the result that would have beenobtained if the program had been executed in a single thread. Dependingon the implementation, collisions can for example also occur when twothreads write to the same memory element in the shared memory.

Execution of a computer program in multiple concurrent threads isintended to speed up program execution, without altering the semanticsof the program. It is therefore of interest to provide a mechanism fordetecting collisions. When a collision has been detected one or morethreads can be rolled back in order to make sure that the semantics ofthe single-threaded program is preserved. A rollback involves restartinga thread at an earlier point in execution, and undoing everything thathas been done by the thread after that point. In the example above, inwhich the older first thread wrote to a memory element that already hadbeen read by the younger second thread, the second thread should berolled back, at least to the point when the memory element was read, ifit is to be guaranteed that the semantics of the single-threaded programis preserved.

A known mechanism for detecting and handling collisions involves keepingtrack of accesses to memory elements by means of associating two or moreflag bits per thread with each memory object. One of these flag bits isused to indicate that the memory object has been read by the thread, andanother bit is used to indicated that the memory object has beenmodified by the thread.

The international patent application WO 00/70450 describes an example ofsuch a known mechanism. Before a primary thread writing to a memoryelement in a shared memory, status information associated with thememory element is checked to see if a speculative thread has read thememory element. If so, the speculative thread is caused to roll back sothat the speculative thread can read the result of the write operation.

A disadvantage of this known mechanism when implemented in software isthat it results in a large execution overhead due to the communicationand synchronization between the threads that is requited for each accessto the shared memory. The status information is accessible to severalthreads and a locking mechanism is therefore required in order to makesure that errors do not occur due to concurrent access to the samestatus information by two threads. There is also a need for memorybarriers (also called memory fences) in order to ensure correct orderingbetween accesses to the shared memory and accesses to the statusinformation.

Another example of a known mechanism for detecting and handlingcollisions is described in Steffan J. G. et al., “The Potential forUsing Thread-Level Data Speculation to Facilitate AutomaticParallelization”, Proceedings of the Fourth International Symposium onHigh-Performance Computer Architecture, February 1998, and in OplingerJ. et al., “Software and Hardware for Exploiting Speculative Parallelismwith a Multiprocessor”, Stanford University Computer Systems LabTechnical Report CSL-TR-97-715, February 1997. An extended cachecoherency protocol is used to support speculative threads.

The flag bits are, according to this technique, associated with cachelines in a first level cache of each of a plurality of processors. Whena thread performs a write operation, a standard cache coherency protocolinvalidates the affected cache line in the other processors. Byextending the cache coherency protocol to include the thread number inthe invalidation request, the other processors can detect read afterwrite dependence violations and perform rollbacks if necessary. Adisadvantage of this approach is that speculatively accessed cache lineshave to be kept in the first level cache until the speculative threadhas been committed, otherwise the extra information associated with eachcache line is lost. If the processor runs out of available positions inthe first level cache during execution of the speculative thread, thespeculative thread has to be rolled back. Another disadvantage is thatthe method requires modifications to the cache coherency protocolimplemented in hardware, and cannot be implemented purely in softwareusing standard microprocessor components.

SUMMARY OF THE INVENTION

As mentioned above the known mechanisms for handling and detectingcollisions have some disadvantages. The problem solved by the presentinvention is to provide mechanisms that simplify handling and detectionof collisions.

A first object of the present invention is to provide a device havingsimplified mechanisms for recording information regarding memoryaccesses to a shared memory.

A second object of the present invention is to provide a simplifiedmethod for recording information regarding memory accesses to a sharedmemory.

A third object of the present invention is to provide a simplifiedmethod for handling possible collisions between a plurality of threads.

The objects of the present invention are achieved by means of anapparatus according to claim 1, by means of a method according to claim17 and by means of a method according to claim 27. The objects of theinvention are further achieved by means of computer program productsaccording to claim 36 and claim 37.

According to the present invention each of a plurality of threads areassociated with a respective data structure for storing informationregarding accesses to the memory elements of the shared memory. When athread accesses a selected memory element in the shared memory,information is stored in its associated data structure, whichinformation is indicative of the access to the selected memory element.According to an embodiment of the present invention collision detectionis carried out after the thread has finished executing by means ofcomparing the data structure of the thread with the data structures ofother threads on which the thread may depend.

An advantage of the present invention is that each thread is associatedwith a respective data structure that stores the information indicativeof the accesses to the shared memory. This is especially advantageous ina software implementation since each thread will only modify the datastructure with which it is associated. The threads will read the datastructures of other threads, but they will only write to their ownassociated data structure according to the present invention. The needfor locking mechanisms is therefore reduced compared with the knownsolutions discussed above in which the information indicative of memoryaccesses were associated with the memory elements of the shared memoryand were modified by all the threads. The reduced need for lockingmechanisms reduces the execution overhead and makes the implementationsimpler. In the software implementation, the absence of locks and memorybarriers during thread execution will also give a compiler more freedomto optimize the code.

Another advantage of the present invention is that, since it does notrequire a modified cache coherency protocol, it can be implementedpurely in software, thus making it possible to implement the inventionusing standard components.

Further advantages of embodiments of the present invention will beapparent from the following detailed description of preferredembodiments with reference to accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a schematic block diagram of a computer system in which thepresent invention is used.

FIGS. 2A and 2B are schematic diagrams that illustrate a computerprogram being executed in a single thread and divided into severalthreads respectively.

FIG. 3A is schematic block diagram that illustrates how data structuresaccording to the present invention are used.

FIG. 3B is schematic block diagram that illustrates how an alternativeembodiment of data structures according to the present invention isused.

FIG. 4 is a flow diagram illustrating how reading from the shared memorymay be performed according to the present invention.

FIG. 5 is a flow diagram illustrating how writing to the shared memorymay be performed according to the present invention.

FIG. 6 is a schematic block diagram that illustrates dependence listsassociated with threads according to the present invention.

FIG. 7 is a flow diagram illustrating how a thread may be executed and acollision check for the thread may be made according to the presentinvention.

DETAILED DESCRIPTION OF SPECIFIC EMBODIMENTS

FIG. 1 illustrates a computer system 1 including two central processingunits (CPUs) first CPU 2 and second CPU 3. The CPUs accesses a sharedmemory 4, divided into a number of memory elements m0, m1, m2, mn. Thememory elements may for instance be equal to a cache line or mayalternatively correspond to a variable or an object in a sourcelanguage. FIG. 1 also shows three threads 5, 6, 7 executing on the CPUs2, 3.

A thread can be seen as a portion of computer program code that isdefined by two checkpoints, a start point and an end point. FIG. 2 ashows a schematic illustration of a computer program 8 comprising anumber of instructions or operations, i1, i2, . . . in. When thecomputer program is executed as a single thread, the normal way ofprocessing the instructions is in the program order, i.e. from top tobottom in FIG. 2A. It is however possible, according to known techniquesas mentioned above, to divide the program into multiple threads. Theprogram 8 may for instance be divided into the three threads 5, 6, 7 asindicated in FIG. 2A. The threads can be executed concurrently. FIG. 2Billustrates an example of a threaded program flow, where the first CPU 2first processes the thread 5 and then the thread 6, and the second CPU 3starts processing thread 7 before the threads 5 and 6 have finishedexecuting on the first CPU 2.

FIG. 2B shows an example of how the threads 5, 6, 7 may execute. Manyother alternative ways of executing the threads are however possible. Itis for instance not necessary that the first CPU 2 finishes processingthe thread 5 before starting on the thread 6 and the thread 6 may beexecuted before the thread 5. The first CPU 2 may be a type of processorthat is able to switch between several different threads such that theCPU 2 e.g. starts processing the thread 5, leaves the thread 5 before itis finished to process the thread 6 and then returns to the thread 5again to continue where it left off Such a processor is sometimes calleda Fine Grained Multi-Threading Processor. A Simultaneous Multi-Threading(SMT) Processor is able to process several threads in parallel, so ifthe CPU 2 is such a processor it is able to process the threads 5, 6simultaneously.

Thus, it is not necessary to have multiple CPUs in order to processmultiple threads concurrently.

Collisions may occur between the threads 5, 6, 7 when the instructionsof the computer program 8 are executed out of program order. Asmentioned above, a collision is a situation in which the threads accessthe shared memory 4 in such a way that there is no guarantee that thesemantics of the original single-threaded program 8 is preserved. It istherefore of interest to provide mechanisms for detecting and handlingcollisions that may arise during speculative thread execution.

According to the present invention each thread 5, 6, 7 is associatedwith a data structure 9, 10, 11, which is illustrated schematically inFIG. 1. The data structure is used to store information indicative ofwhich memory elements in the shared memory 4 that the respective threadhas accessed. According to an embodiment of the present invention eachdata structure includes a number of bits 12 that correspond to thememory elements in the shared memory. According to the embodiment of thepresent invention shown in FIG. 1 the bits 12 of each data structure 9,10, 11 are divided into a load vector 9 a, 10 a, 11 a and a store vector9 b, 10 b, 11 b. For each memory element m0, m1, m2, mn in the sharedmemory 4, there is exactly one corresponding bit 12 in the load vectorand exactly one corresponding bit 12 in the store vector associated witheach thread. When the thread 6 reads from a memory element, it sets thecorresponding bit 12 in the load vector 9 a to indicate that the memoryelement has been read. The store vector 9 b is updated analogously whenthe thread 6 writes to the shared memory.

There can either be a one-to-one correspondence or a many-to-onecorrespondence between the memory elements and the bits in the load andstore vectors. By having a many-to-one correspondence, the memoryoverhead is reduced at the cost of spurious collisions, which causesslower execution. Reducing the memory overhead will however also resultin reduced execution overhead, since there will be fewer cache misses. Ahash function can be used to map a number of a memory element to a bitposition in the load and store vectors.

FIG. 3A illustrates an example of how the data structures 9, 10, 11 areused according to the present invention. In this example the thread 5has written to the memory elements m1 and m4 and read memory elementsm1, m5 and m8. The thread 6 has written to the memory elements m2, m6and m9 and read the memory elements m2, m6 and m13. The thread 7 hasread the memory element m12. In this example, there are more memoryelements in the shared memory than there are bit positions in the loadand store vectors, which means that there is a many-to-onecorrespondence between the memory elements and the bits in the load andstore vectors. In this example the bit position in the load and storevector that corresponds to a selected memory element is found using ahash function, which in this example simply calculates the remainderwhen dividing the number of the memory element by the size of the loadand store vectors. This means that when the thread 5 writes to thememory elements m1, it sets the bit in position number 1 in its storevector and when the thread 6 writes to the memory element m9, it setsthe bit in position number 1 in its store vector. When the threads haveperformed the write and read operations mentioned above, the bitposition numbers that are set will be 0, 1, 5 for the load vector 9 a;1, 4 for the store vector 9 b; 2, 5, 6 for the load vector 10 a; 1, 2, 6for the store vector 10 b and 4 for the load vector 11 a. This isillustrated in FIG. 3A by means of filled boxes representing the bitsthat are set.

The implementation of the present invention can be simplified by meansof the data structures 9, 10, 11 each comprising a single combined loadand store vector instead of a separate load vector and a separate storevector. FIG. 3B illustrates the same example as described above withreference to FIG. 3A, with the only difference that the data structures9, 10, 11 each includes a single combined load and store vector 9 c, 10c, 11 c instead of the load vectors 9 a, 10 a, 11 a and the storevectors 9 b, 10 b, 11 b. The bit positions that are set in the combinedload and store vector 9 c correspond to a logical bitwise inclusive oroperation of the load vector 9 a and store vectors 9 b shown in FIG. 3B.

The embodiment of the present invention wherein the data structuresincludes a single combined load and store vector results in an increasednumber of spurious collisions, but on the other hand it also results ina reduced need for memory to store the data structures and a reducednumber of operations when checking for collisions, as will be discussedfurther below.

The embodiments of the present invention shown in FIGS. 3A and 3B uses atype of data versioning called privatisation, which means that a privatecopy 14 of a memory element that is to be modified is created for thethread that modifies the element. The thread then modifies the privatecopy instead of the original memory element in the shared memory. Theprivate copies contain pointers 15 to their corresponding originalmemory element in the shared memory. The private copies are used towrite over the original memory elements in the shared memory 4 when thethreads for which they were created are committed. If a thread is rolledback, its associated private copies 14 are discarded. FIG. 4 shows aflow diagram illustrating how reading from the shared memory isperformed when privatisation is used. FIG. 5 shows a corresponding flowdiagram for writing to the shared memory.

FIG. 4 shows a first step 20, wherein the memory element to be read ismarked as read in the load vector. In step 21, it is examined whether ornot the thread has a private copy of the memory element to be read. If aprivate copy exists the data is read from the private copy, step 22. Ifthere is no private copy the data is read from the memory element in theshared memory, step 23.

FIG. 5 shows a first step 25, wherein it is examined whether or not thethread has a private copy of the memory element to be written to. Ifthere is no private copy, the memory element to be written to is markedas written in the store vector, step 26, and a private copy is created,step 27. The data is then written to the private copy, step 28. If aprivate copy is found to exist in step 25, the data can be written tothe private copy directly, step 28, without having to make a mark in thestore-vector or create the private copy.

The privatisation described above is not a prerequisite of the presentinvention. Another type of data versioning, which may be used instead ofprivatisation, involves that the threads store backup copies of thememory elements before they modify them. These backup copies are thencopied back to the shared memory during a rollback.

The embodiments of the present invention described above comprise datastructures in the form of bit vectors for storing information indicativethe thread's accesses to the memory. However, many alternative types ofdata structures for storing this information are possible according tothe present invention. The data structures may for instance beimplemented as lists to which numbers that correspond to the memoryelements are added to indicate accesses the memory elements. Otherpossible implementations of the data structures include trees, hashtables and other representations of sets.

It will now be discussed how the thread associated data structures ofthe present invention can be used to check for and detect collisions.

In a software implementation where the thread associated data structuresof the present invention are used to check for collisions, a thread thathas collided with another thread will itself detect the collision. Inthe known mechanisms discussed above an older thread would detect if ayounger thread has collided and send a message about this so that theyounger thread would be rolled back. This sending of messages takes timeand causes an extra delay, which can be avoided by means of the presentinvention.

According to a preferred embodiment of the present invention collisionchecks are performed after the thread has finished its execution and isabout to be committed. The collision check is made by means of comparingthe data structure associated with the thread to be checked with thedata structures associated with other threads on which the thread to bechecked may depend. In order to keep track of the possible dependenciesbetween threads a dependence list may be created for each thread beforeit starts executing. This is illustrated in FIG. 6, by means of thethreads 5, 6, 7 which are associated with dependence lists 16, 17 and 18respectively. The dependence lists are lists of all older threads thathad not yet been committed when the thread was about to start executing.The thread 7 may depend on threads 5 and 6 so its dependence list 18contains references to threads 5 and 6 to indicate the possibledependency.

The dependence list described above is just an example of how to keeptrack of possible dependencies between threads. The dependence list isnot limited to a list structure but can also be represented as analternative structure that can store information regarding possibledependencies. It is further not necessary for the dependence list tostore a reference to all older not yet committed threads. For example inan implementation where forwarding is used it may be possible todetermine that the thread to be started is not dependent on some of theolder not yet committed threads and it is then not necessary to store areference to these threads in the dependence list. In other cases theinformation stored in the dependence list may refer to an interval ofthreads of which some already have been committed when the dependencelist is created. As long as the dependence list includes a reference toall the threads that the thread to be started depends on there is noharm in the dependence list also including references to some threadsthat the thread to be started clearly does not depend on.

FIG. 7 shows a flow diagram of how a thread may be executed and acollision check for the thread may be made according to the presentinvention. In a step 30, the dependence list for the thread to beexecuted is created. The thread is then executed in a step 31. When thethread has finished executing, it waits until the threads that it maydepend on have been checked for collisions and are ready to becommitted, step 32. It then compares its associated data structure tothe data structures associated with the threads in the dependence listto check for collisions, step 33. If no collision is detected, thethread is committed in a step 34, otherwise the thread is rolled back ina step 35. If the thread has collided with another thread, the risk thatthe thread collides with the same thread again may be reduced by meansof delaying the restart of the thread until the thread it collided withhas been committed. The system may be arranged to give higher priorityto committing threads with which other threads have collided.

When the collision check is performed as described above, even theoldest not yet committed thread is speculative, since it might havecollided with an earlier thread that already has been committed and thisis not detected until the thread has finished its execution. However,when a thread has become the oldest not yet committed thread, it willhave to be rolled back at the most once, since when it is restarted,there is no other thread that it can collide with.

Alternatively one or several partial collision checks may be performedduring execution, before performing the collision check when the threadhas finished executing. The partial collision check can be performedwithout locking the data structures associated with other threadsbecause it is acceptable that the partial check fails to detect somecollisions. Collisions that were not detected in the partial collisioncheck will be detected in the final collision check that is performedafter the thread has finished its execution.

The comparison between two data structures to detect collisions isperformed differently depending on whether or not the data structuresincludes separated load and store vectors or a combined load and storevector. If the data structures have separated load and store vectors thecomparison between the load and store vectors of an older and a youngerthread can be carried out by means of performing the following logicaloperations bitwise on the bit vectors: old store vector AND (young storevector OR young load vector).

If the resulting vector contains any bits that are set there is acollision and the younger thread should be rolled back. If the datastructures have combined load and store vectors the correspondinglogical operation to be performed to check for collisions is anAND-operation between the combined vector of the older thread and thecombined vector of the younger thread.

In an alternative embodiment the comparison to detect collisions iscarried out by means of performing the following logical operationbitwise on the bit vectors: old store vector AND young load vector.

This comparison assumes that the threads are committed in program orderand that when a write operation that only modifies part of a memoryelement (which corresponds to a read-modify-write operation) is carriedout the corresponding bit in both the load and the store vector is set.

An advantage of the collision check of the present invention is thatsince collisions do not have to be detected until the thread hasfinished executing, there is no need for any locking mechanism or memorybarriers during execution. This reduces the execution overhead and makesthe implementation simpler. Another reason why the execution overheadcan be reduced according to the present invention is that if thecollision check is only performed when the thread has finishedexecuting, at most one check will have to be made for each accessedmemory element, even if the element has been accessed many times duringexecution. In the known mechanisms discussed above a collision check wasperformed in connection with each access to the shared memory.

The cost of handling collisions according to the present invention isthat collisions are not detected as early as possible, which results insome wasted data processing of threads that already have collided andshould be rolled back. However, the gain in execution overhead will inmany cases surpass the cost of not detecting collisions immediately. Thecollision check of the present invention described above is thusparticularly favorable when collisions are rare.

According to the present invention, the only thing that has to beperformed in the same order as in the original single-threaded programis the collision check. Threads can be executed and rolled back out ofprogram order and depending on the implementation sometimes alsocommitted out of program order.

If the many-to-one correspondence between the memory elements and thebits in the load and store vectors is used, the load and store vectorscan have a fixed size. The memory overhead is then proportional to thenumber of threads instead of the number of memory elements, which meansthat the amount of memory needed to store the data structures willremain the same when the number of memory elements in the shared memoryincreases.

The present invention can be implemented both in hardware and insoftware. In a hardware implementation it is possible to use a fastfixed-size memory inside each processor to store the data structures. Ina software implementation a speed advantage will be obtained if the datastructures are made small enough to be stored in the first level cacheof the processor. Due to the frequent use of the data structures it willbe advantageous to store them in as fast memory as possible.

The data structure associated with a thread will naturally only have tobe stored in memory until the thread with which it is associated and allthreads that may depend on the thread are committed. Once the thread andall threads that may depend on it ate committed the memory used to storeits associated data structure can be reused.

The present invention is not limited to any particular type of memoryelements of a shared memory. The present invention is applicable to bothlogical and physical memory elements. Logical memory elements are forexample variables, vectors, structures and objects in an object orientedlanguage. Physical memory elements are for example bytes, words, cachelines, memory pages and memory segments.

As described above a thread comprises a number of program instructions.Other terms for a series of instructions that are sometimes used in thefield. An example of such a term is job.

Thread-level speculative execution with a shared memory has manysimilarities to a database transaction system. The entries of a databasecan be compared with the elements of a shared memory and since adatabase transaction includes a number of operations, a databasetransaction can be compared with a thread. One way to ensure that adatabase remains consistent is to check for collisions between differentdatabase transactions. Thus the principles of the ideas of the presentinvention may be used also in this field.

It is to be understood that the embodiments of the present inventiondiscussed above and illustrated in the figures, merely serves asexamples to illustrate the ideas of the present invention and that theinvention in no way is limited to just the examples described. Theexamples are for instance simple examples that only illustrate a fewmemory elements in the shared memory and a few bits in the datastructures associated with the threads. In reality the number of memoryelements and bits can be very large. The present invention is furthernot limited to any particular number of threads or CPUs.

1. An apparatus that supports execution of computer program instructionsspeculatively out of program order comprising: a plurality of threadsfor executing computer program instructions, and a shared memory, whichcomprises a number of shared memory elements accessible to the pluralityof threads; wherein each of the threads is associated with a respectivedata structure for collision detection between the plurality of threads,said data structure being arranged to store information indicating whichof said number of shared memory elements the associated thread hasaccessed, and wherein each of the threads includes means for accessing aselected shared memory element in the shared memory, and means forstoring information in the associated data structure of the thread, saidinformation indicating the thread's access to the selected shared memoryelement.
 2. The apparatus according to claim 1, wherein the datastructures are one of the following types of structures: an unsortedlist, a sorted list, a tree and a table.
 3. The apparatus according toclaim 1, wherein each data structure comprises a number of bits thatcorrespond to the shared memory elements of the shared memory, andwherein the means for storing information include means for setting atleast one chosen bit, corresponding to the selected shared memoryelement.
 4. The apparatus according to claim 3, wherein the datastructure comprises a load vector and a store vector, wherein the meansfor setting at least one chosen bit is arranged to set a bit in the loadvector when the first thread accesses the selected shared memory elementin order to read it, and wherein the means for setting at least onechosen bit is arranged to set a bit in the store vector when the firstthread accesses the selected shared memory element in order to write toit.
 5. The apparatus according to claim 3, wherein the data structurecomprises a single combined load and store vector.
 6. The apparatusaccording to claim 4, wherein there is a one-to-one correspondencebetween the shared memory elements in the shared memory and the bits inthe, or each, vector of the data structure.
 7. The apparatus accordingto claim 4, wherein there is a many-to-one correspondence between theshared memory elements in the shared memory and the bits in the, oreach, vector of the data structure.
 8. The apparatus according to claim7, wherein the correspondence between the bits in the, or each vector,and the shared memory elements is determined by a hash function thatmaps the shared memory elements to the bits in the, or each, vector. 9.The apparatus according to claim 1, further comprising: means forchecking whether a thread has a private copy of the selected sharedmemory element; means for creating a private copy of the selected sharedmemory element and means for reading and writing to a private copy ofthe selected shared memory element.
 10. The apparatus according to claim1, further comprising means for storing a backup copy of the selectedshared memory element.
 11. The apparatus according to claim 1, furthercomprising: means for determining, when a first thread has finishedexecution, whether each of the threads on which the first thread maydepend is ready to be committed; and means for checking for a collisionbetween the first thread and each of the threads on which the firstthread may depend, said means for checking including means for comparingthe data structure associated with the first thread with each respectivedata structure associated with the threads on which the first thread maydepend.
 12. The apparatus according to claim 11, further comprisingmeans for creating a dependence list associated with the first threadbefore execution of the first thread, said dependence list including areference to each thread which has not yet been committed and whichcomes before the first thread in program order.
 13. The apparatusaccording to claim 11, further comprising: means for committing thefirst thread if no collision is detected between the first thread andany of the threads on which the first thread may depend; and means forrestarting execution of the first thread if a collision is detectedbetween the first thread and any of the threads on which the firstthread may depend.
 14. The apparatus according to claim 13, furthercomprising means for delaying a restart of execution of the first threaduntil the threads or each of the threads, with which the first threadhas collided has been committed.
 15. The apparatus according to claim14, further comprising means for giving priority to committing and/orexecuting the thread, or each of the threads, with which the firstthread has collided.
 16. The apparatus according to claim 11, furthercomprising means for performing a partial check for collisions betweenthe first thread and at least one of the threads on which the firstthread may depend, said means for performing a partial check includingmeans for comparing the data structure associated with the first threadwith the respective data structure associated with the at least one ofthe threads on which the first thread may depend.
 17. A method forrecording information regarding accesses to a number of shared memoryelements of a shared memory, said shared memory being accessible by aplurality of threads that are arranged to execute computer programinstructions speculatively out of program order, wherein each of thethreads is associated with a respective data structure for collisiondetection between the plurality of threads, said data structure beingarranged to store information indicating which of said number of sharedmemory elements the associated thread has accessed, said methodcomprising the steps of: accessing by a first of the plurality ofthreads, a selected shared memory element in the shared memory, andstoring by the first thread, information in the associated datastructure of the first thread, information indicating the first thread'saccess to the selected shared memory element.
 18. The method accordingto claim 17, wherein the data structure is one of the following types ofstructures: an unsorted list, a sorted list, a tree and a table.
 19. Themethod according to claim 17, wherein each data structure comprises anumber of bits that correspond to the shared memory elements of theshared memory, and wherein the step of storing information includessetting a chosen bit in the data structure corresponding to the selectedshared memory element.
 20. The method according to claim 19, wherein thedata structure comprises a load vector and a store vector, wherein thechosen bit is a bit in the load vector if the first thread accesses theselected shared memory element in order to read it, and wherein thechosen bit is a bit in the store vector if the first thread accesses theselected shared memory element in order to write to it.
 21. The methodaccording to claim 19, wherein the data structure comprises a singlecombined load and store vector.
 22. The method according to claim 20,wherein there is a one-to-one correspondence between the shared memoryelements in the shared memory and the bits in the, or each, vector ofthe data structure.
 23. The method according to claim 20, wherein thereis a many-to-one correspondence between the shared memory elements inthe shared memory and the bits in the, or each, vector of the datastructure.
 24. The method according to claim 23, wherein thecorrespondence between the bits in the, or each, vector and the sharedmemory elements is determined by means of mapping the memory elements tothe bits in the, or each, vector using a hash function.
 25. The methodaccording to claim 17, further comprising the steps of: determining bythe first thread, whether the first thread has a private copy of theselected shared memory element; if the first thread has a private copyand the first thread accesses the selected shared memory element inorder to read it, reading by the first thread from the private copy; ifthe first thread does not have a private copy and the first threadaccesses the selected shared memory element in order to read it, readingby the first thread from the selected shared memory element in theshared memory; if the first thread has a private copy and the firstthread accesses the selected shared memory element in order to write toit, writing by the first thread to the private copy; and if the firstthread does not have a private copy and the first thread accesses theselected shared memory element in order to write to it, creating by thefirst thread a private copy of the selected shared memory element, andwriting by the first thread to the private copy.
 26. The methodaccording to claim 17, further comprising the steps of: if the firstthread accesses the selected shared memory element in order to write toit, storing by the first thread, a backup copy of the selected sharedmemory element; and writing by the first thread to the selected sharedmemory element in the shared memory after the backup copy is stored. 27.A method for handling possible collisions between a plurality ofthreads, said threads being arranged to execute computer programinstructions speculatively out of program order and to access sharedmemory elements of a shared memory, said method comprising the steps of:executing a first thread; determining, when the first thread hasfinished execution, whether each of the threads on which the firstthread may depend is ready to be committed; waiting until each of thethreads on which the first thread may depend is ready to be committed,if each of the threads on which the first thread may depend is not readyto be committed; and checking for a collision between the first threadand each of the threads on which the first thread may depend bycomparing a data structure associated with the first thread with a datastructure associated with the thread on which the first thread maydepend, said data structure storing information regarding which of theshared memory elements the thread with which the data structure isassociated has accessed during execution of the thread.
 28. The methodaccording to claim 27, wherein each data structure includes a number ofbits that correspond to the shared memory elements of the shared memory,and wherein a bit is set if the shared memory element to which the bitcorresponds has been accessed by the thread with which the datastructure is associated during execution of the thread.
 29. The methodaccording to claim 28, wherein each data structure includes a loadvector and a store vector, wherein a bit in the load vector is set ifthe memory element to which the bit corresponds has been read by thethread with which the data structure is associated during execution ofthe thread, and wherein a bit in the store vector is set if the memoryelement to which the bit corresponds has been written to by the threadwith which the data structure is associated during execution of thethread.
 30. The method according to claim 28, wherein each datastructure comprises a single combined load and store vector.
 31. Themethod according to claim 27, further comprising creating a dependencelist associated with the first thread before execution of the firstthread, said dependence list including a reference to each thread whichhas not yet been committed and which comes before the first thread inprogram order.
 32. The method according to claim 27, wherein the firstthread is committed if no collision is detected, and wherein theexecution of the first thread is restarted if a collision is detected.33. The method according to claim 32, wherein the restart of executionof the first thread is delayed until the thread, or each of the threads,with which the first thread collided has been committed.
 34. The methodaccording to claim 33, wherein priority is given to committing and/orexecuting the thread, or each of the threads, with which the firstthread collided.
 35. The method according to claim 27, furthercomprising performing a partial check for collisions between the firstthread and at least one of the threads on which the first thread maydepend by comparing the data structure associated with the first threadwith the respective data structure associated with the at least one ofthe threads on which the first thread may depend, wherein no locking ofthe data structures takes place while the partial check is performed.36-37. (Canceled)